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1.  Introduction 

We  propose  a  few  implementations  of  the  Alternating  Direction  Method  for  solving  parabolic 
partial  differential  equations  on  multiprocessors.  A  careful  complexity  analysis  of  these  implemen¬ 
tations  shows  that,  contrary  to  what  is  generally  believed,  the  method  can  be  made  highly  efficient 
on  parallel  architectures  by  using  pipelining  and  variations  of  the  classical  Gaussian  elimination 
algorithm  for  solving  tridiagonal  systems.  In  [15]  we  showed  that  we  could  obtain  linear  speedups 
for  moderate  numbers  of  processors  in  a  ring  architecture.  In  this  paper  we  discuss  extensions  to 
a  large  number  of  processors  in  a  2-D  grid  architecture  and  a  hypercube. 


2.  The  Alternating  Direction  Method  and  its  parallel  implementations 
We  consider  the  partial  differential  equation: 

5u\ 


du  d  (  .  .  e)u\  d  'du\ 

at  -  Tx  )  +  Ty 


on  the  domain  (x,  y,  t)  6  fl  x  [0,  T]  =  [0, 1]  x  [0, 1]  x  [0,  T],  with  the  initial  and  boundary  conditions: 

u(x,y,0)  =  u0(x,y),  V(x,y)  E  fl, 
u(i,  y,  t )  =  g {£ ,  y, /),  V(i, y)  €  dfl,  t> 0, 

where  dCl  is  the  boundary  of  the  unit  square  Cl.  After  the  description  and  analysis  of  algorithms 
for  the  two-dimensional  problem  we  generalize  the  results  to  higher  dimensional  domains. 

A  common  approach  to  solve  the  above  problem  is  the  Alternating  Direction  Method.  First 
the  equations  are  discretized  with  respect  to  the  space  variables  x  and  y  using  a  mesh  of  n  interior 
points  in  each  direction.  The  result  is  the  system  of  Ar  &  n2  ordinary  differential  equations: 


du  ■ 

-gj  =  Ax  w  +  B¥u 


(2.1) 


in  which  the  matrices  Az  and  Bt  represent  the  3-point  central  difference  approximations  to  the 
operators  &(<*(*, y)£))  and  £(&(x,y)£))  respectively. 

The  Alternating  Direction  Method  algorithm  consists  in  stepping  (2.1)  forward  in  time  alter¬ 
nately  in  the  x  and  y  directions  as  follows  [21]: 


(/-£A1AxK+U(/+1a<Bi,)u’ 
(I-±±tD,)u*'  =  (I+±±tA,)vi+l. 


(2.2) 

(2.3) 


We  observe  that  if  the  mesh  poiuts  are  ordered  by  lines  in  the  x  direction,  then  (2.2)  constitutes 
a  set  of  n  independent  tridiagonal  systems  whose  solution  is  perfectly  parallellizable.  However, 
(2.3)  constitutes  one  tridiagonal  system  in  which  the  uonzero  diagonals  are  the  main  diagonal  and 
the  (n  +  l)*1  super-  and  sub-diagonals.  It  is  important  to  note  that  this  second  system  can  also 
be  recast  into  a  set  of  n  independent  tridiagoual  systems  by  reordering  the  grid  points  by  lines, 
this  time  in  the  y  direction.  This  essentially  amounts  to  transposing  the  matrix  representing  the 
solution  at  the  n'x  n  grid  points  and  is  an  expensive  data  permutation  operation  which  is  often 
cited  as  the  main  drawback  of  Alternatiug  Direction  Method  in  regard  to  its  implementation  on 
parallel  machines.  The  other  difficulty  that  has  been  traditionally  associated  with  parallelizing  the 


Alternating  Direction  Method  is  that  the  classical  algorithms  for  solving  tridiagonal  systems  are 
sequential  in  nature.  The  arithmetic  complexity  of  the  Alternating  Direction  Method  on  an  idealized 
parallel  architecture  without  contention  for  storage  or  access  conflicts  has  been  investigated  by 
Sameh  et  al.  [18].  Various  parallel  implementations  of  the  Alternate  Direction  Method  have  been 
discussed  by  Gannon  and  Van  Rosendale  [3]  Lambiotte  [9]  and  Ortega  and  Voigt  [l  l). 

The  main  purpose  of  this  paper  is  to  describe  some  data  structures  and  algorithms  that  effi¬ 
ciently  use  some  candidate  multiprocessor  configurations  for  future  parallel  computer  systems.  We 
limit  the  discussion  to  the  solution  of  the  two  tridiagonal  systems  (2.2)  and  (2.3).  The  commu¬ 
nication  required  for  the  computation  of  the  right  hand  side  is  almost  identical  to  that  required 
for  the  tridiagonal  system,  and  the  arithmetic  complexity  is  approximately  half  of  that  for  the 
tridiagonal  system.  Since  the  choice  of  data  structure  and  algorithm  is  unaffected  by  the  inclusion 
of  the  computations  for  the  right  hand  side  we  omit  them.  On  a  single  processor  each  half  step 
costs  5n2  operations  for  the  forward  and  3n2  operations  for  the  backward  sweep.  Assuming  that 
s  arithmetic  operations  can  be  done  per  second,  the  time  for  a  half  step  on  a  single  processor  is 
approximately 

r,  =  !£.  (2.4) 

Throughout  the  paper  it  is  assumed  that  communication  and  arithmetic  computations  cannot 
take  place  concurrently.  This  assumption  is  essentially  made  for  convenience  and  simplicity  of  our 
analysis  and  does  not  affect  the  qualitative  aspects  of  the  conclusion. 

3.  Alternating  Direction  Methods  on  the  shared  memory  model  and  the  broadcast  bus  model 

In  this  section,  we  assume  that  k  processors  are  connected  to  one  large  shared  memory  with 
total  bandwidth  B  words  per  second  and  start  up  r.  Moreover,  we  assume  that  the  I/O  bandwidth 
of  each  processor  is  b.  Thus  B/b  different  processors  can  simultaneously  access  the  shared  memory. 

The  transfer  of  data  between  two  processors  is  done  by  the  first  processor  writing  to  the  shared 
memory  and  then  the  second  processor  reading  from  the  shared  memory.  Consider  a  method 
consisting  in  solving  n/k  similar  tridiagonal  systems  in  each  processor  for  (2.2)  then  transposing 
the  data  and  solving  (2.3)  as  n/k  tridiagonal  systems  in  each  processor  (it  is  assumed  that  k  <  n). 
The  total  time  required  for  performing  the  arithmetic  operations  of  one  half  step  is  approximately 


Performing  a  transpose  of  the  u  x  n  matrix  of  the  unknown  u  requires  the  communication 
time: 

Ts,Co,„m  *  Ijqi  (t  + 

b  n2 

*  2k— t  +  2  —  . 

B  B 

Hence  in  the  shared  memory  model,  it  takes  a  total  time  of 

b  »2  8«2 

rs*2,'B,  +  2B  +  T7 

to  perform  one  half  step  of  the  Alternating  Direction  Method. 

Assume  now  that  all  k  processors  are  linked  to  each  other  via  a  global  bus  which  allows  for 
broadcasting.  The  time  to  perform  the  arithmetic  operations  is  the  same  as  for  the  shared  memory 
model. 
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Figure  1:  Domain  decomposition  and  assignment  of  the  unit 
square  into  4  processors. 
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To  transpose  the  matrix,  each  processor  broadcasts  in  turn  its  data  to  all  the  other  processors. 
This  requires  a  total  communication  time  of 


.  »2 .  n2 

Tfl.Comm  «  k\T  +  £jj)  —  +  ~ 


Hence  the  total  time 


n*  8n2 


i 


TB*tr+-r  +  -r- 

b  ks 


for  performing  one  half  step  of  the  Alternating  Direction  Method. 

Observe  that  neither  of  the  above  two  models  achieves  a  reasonable  speed  up  when  k  increases 
because  of  the  limiting  term  n2  in  the  communcation  cost.  This  is  the  source  of  the  belief  that 
Alternating  Direction  Methods  do  not  parallelize  well. 


4.  Alternating  Direction  Methods  on  a  multiprocessor  ring 

In  the  ring  architecture  the  k  processors  are  arranged  in  a  ring  and  each  processor  can  commu¬ 
nicate  with  its  two  nearest  neighbors.  It  is  assumed  throughout  that  the  processors  are  numbered 
so  that  processor  numbered  i  is  connected  to  processors  numbered  i  +  1  and  i  —  1  modulo  k.  In 
order  to  efficiently  implement  the  algorithm,  we  will  first  assign  the  grid  points  such  as  to  minimize 
data  communication  costs.  The  assignment,  which  is  described  in  Figure  1  for  a  ring  of  Jt  =  4 
processors,  will  essentially  avoid  transposing  the  data  at  each  half  step. 

This  assignment  results  in  each  of  the  tridiagonal  systems  (2.2),  being  split  into  k  equal  parts, 
one  part  per  processor.  We  consider  the  cost  of  the  half  step  involving  the  sweep  in  the  x-direction. 
Looking  at  the  leftmost  column  of  subsquares  in  Figure  1,  each  of  the  k  processors  of  that  column 
can  start  performing  the  forward  sweeps  simultaneously  on  the  n/k  tridiagonal  systems  that  it 
holds.  Thus,  the  sweep  eliminates  the  variables  horizontally  from  left  to  right.  When  the  boundary 
of  the  first  subsquare  is  reached,  each  processor  sends  to  its  right  neighbor  in  the  figure  the  data 
that  is  necessary'  for  that  processor  to  continue  its  forward  sweep  on  its  part  of  its  tridiagonal 
systems,  i.e.,  the  processor  numbered  j  sends  to  its  neighbor  numbered  (j  +  1)  Mod  k  data  for  the 
I)|+l  to  jf  tridiagonal  systems.  The  forward  sweep  can  now  be  pursued  on  the  second  column 
of  .subsquares,  and  this  process  is  repeated  until  the  forward  sweep  is  completed  on  the  variables 
of  the  last  column  of  subsquares.  The  backward  sweep  is  accomplished  in  a  similar  fashion  except 
that  we  proceed  from  right  to  left.  The  half  step  in  the  y  direction  is  performed  similarly,  with  the 
forward  sweep  proceeding  from  bottom  to  top  and  the  backward  sweep  from  top  to  bottom.  We 
observe  that  no  processor  is  idle  at  any  time.  It  is  clear  that  the  time  to  perform  the  arithmetic 
operations  is  again  of  the  form: 

it  8  n  8  n2 
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Figure  2:  Assignment  of  the  subdomains  to  the  16  processors 
of  a  4  x  4  grid 

When  we  cross  an  interface  in  the  forward  sweep,  we  must  transfer  the  pivot  row  plus  the 
corresponding  element  of  the  right  hand  side  of  each  of  the  n  tridiagonal  systems  in  each  processor 
numbered  j  to  the  processor  numbered  (j + 1)  Mod  k.  These  I/O  operations  can  be  done  in  parallel 
for  each  of  the  processors  on  the  same  column.  A  total  of  k  —  1  interfaces  must  be  crossed.  For  the 
backward  sweep,  we  transfer  only  one  element  of  the  current  solution  per  system.  Therefore: 

TR,comm  =  (*  “  *)  (T  +  ^  (r  +  U)  *  2^k  ~  +  T' 

Thus,  the  total  execution  time  comes  to 

r*(/.)  =  2(*-l)r+£  +  5£. 

Clearly,  the  time  for  solving  in  the  other  direction  is  identical. 

We  observe  that  with  increasing  k  the  arithmetic  time  decreases  while  the  communication  time 
increases  linearly.  The  function  Tr(&)  has  a  minimum  which  is  achieved  for 


and  the  corresponding  optimal  time  is  linear  in  n: 

2^+£)«. 

Thus  by  an  appropriate  choice  of  algorithm  on  the  ring,  with  O(n)  processors  we  can  achieve  an 
0(n)  speed-up  contradicting  the  conventional  wisdom. 

5.  Alternating  Direction  Method  on  a  square  grid  of  processors 

In  this  section  we  consider  a  square  two-dimensional  grid  of  processors.  We  assign  the  sub¬ 
squares  defined  in  the  previous  section  naturally  onto  the  grid,  as  shown  in  Figure  2  in  which  the 
numbers  in  each  square  indicate  the  processor  number  to  which  the  subsquare  is  assigned.  It  is 
convenient  to  define  k  s  \/k. 
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This  assignment  resembles  the  one  of  the  previous  section  but  each  subsquare  is  now  handled  by 
a  different  processor.  If  we  proceeded  as  we  did  for  the  ring,  we  would  have  many  processors  inactive 
at  all  times.  In  fact  at  any  given  time  we  would  have  only  k  processors  working  simultaneously; 
namely  all  those  holding  the  same  column  of  subsquares.  To  avoid  this  we  can  think  of  pipelining 
the  computation:  the  forward  sweep  is  started  with  only  one  tridiagonal  system  in  each  of  the 
processors  in  the  first  column  of  subsquares  (Processors  1,  5,  9  and  13  in  the  illustration  of  Figure 
2).  As  soon  as  these  processors  are  finished  with  the  first  h/k  steps  of  Gaussian  elimination  of 
their  first  tridiagonal  systems  they  start  the  forward  sweep  on  the  k  second  tridiagonal  systems, 
while  processors  to  their  right  in  the  grid  start  working  on  their  part  of  the  first  tridiagonal  system. 
Thus,  after  /c  —  1  tridiagonal  systems  are  solved,  all  processors  will  be  active.  It  is  easy  to  see 
that  the  computation  consists  of  forward  sweeps  on  a  total  of  («/«)  +  k  —  1  tridiagonal  systems 
of  size  n//c  each:  n/tc  to  complete  all  the  first  parts  in  each  of  the  subsquares  of  the  first  column 
of  processors,  and  then  k  —  1  similar  eliminations  to  complete  the  parts  through  the  last  column. 
Moreover,  these  are  non-simultaneous  eliminations.  Each  of  the  partial  eliminations  is  followed  by 
a  transfer  of  the  pivot  row,  i.e.  3  elements.  The  backward  sweep  is  similar.  We  find  a  total  time  of 

,n  ,,  L  4  ,  8n 

tGO  —  ( — k  —  1)  2r  +  -  +  — 

K  OKS 

For  k?  =  k  <  n,  a  nearly  linear  speed-up  over  the  run  time  for  a  single  processor  is  achieved 
and  the  optimal  time  is  linear  in  n.  While  this  is  asymptotically  equivalent  to  the  result  for 
the  ring  of  processors,  it  requires  an  order  of  magnitude  less  time  for  just  the  data  movement. 
Unfortunately,  the  above  time  cannot  be  less  than  linear  in  n,  no  matter  how  we  choose  k,  for 
example  if  n  <  k  <  n2  i.e.,  it  is  not  possible  to  significantly  outperform  the  ring  architecture  for 
implementing  the  Alternating  Direction  Method  with  this  algorithm.  The  reason  for  this  is  that  we 
are  solving  tridiagonal  systems  with  the  classical  Gaussian  elimination  algorithm.  With  Gaussian 
elimination,  each  tridiagonal  system  cannot  be  solved  in  time  less  than  8 n/s,  independent  of  the 
number  of  processors  used  because  of  its  sequential  nature. 

To  exploit  the  more  powerful  interconnection  features  of  the  grid  architecture,  we  must  use 
a  different  algorithm.  Lambiotte  [9]  suggested  using  Gaussian  elimination  in  one  direction  and 
odd-even  cyclic  reduction  in  the  other.  This,  however  will  still  require  a  linear  time  because  of  the 
Gaussian  elimination  in  one  direction.  More  recently,  Gannon  and  Van  Rosendale  [3]  have  proposed 
using  a  scheme  based  on  what  the  mechanical  and  structural  engineers  call  problem  substructuring 
or  condensation  techniques. 

There  are  many  substructuring  Gaussian  elimination  algorithms.  The  one  briefly  described 
next  is  due  to  Wang  [22].  It  is  a  variation  of  an  algorithm  first  presented  by  Kuck  and  Sameh 
[19,  10].  Advantages  of  Wang's  variant  are  its  low  arithmetic  complexity  with  almost  no  increase 
in  communication  complexity  and  its  improved  numerical  properties  [6].  The  algorithm  can  be 
briefly  described  as  follows.  Assume  that  we  want  to  solve  one  tridiagonal  system  Tx  —  b  of  size 
n  in  k  processors,  each  of  which  holds  h/k  successive  rows  of  the  system.  The  algorithm  consists 
of  three  phases.  In  the  first  phase  a  variant  of  Gaussian  elimination  is  performed  with  almost 
no  interprocessor  communication.  In  the  jlh  processor,  j  =  {1,2, ...,«},  a  forward  elimination  is 
carried  out  on  equations  (j  -  1)£  +  i,  »  =  {2,3....,  £},  in  order  to  eliminate  the  sub-diagonal 
elements  within  each  £  x  j  diagoual  block  of  T.  This  requires  no  interprocessor  communication 
and  all  the  processors  compute  in  parallel.  Then,  a  backward  elimination  is  carried  out  to  eliminate 
the  super-diagonal  elements  in  rows  j%  —  2  through  (j  -  1)  £  for  j  =  {2,3,  ...,/c}  and  in  rows  j  -  2 
to  1  for  j  s=  1.  After  a  small  amount  of  interprocessor  computation,  all  processors  compute  in 
parallel.  The  result  of  this  first  phase  is  illustrated  in  Figure  3  for  a  system  of  size  20  distributed  in 
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Figure  3:  A  substructuring  Gaussian  elimination  algorithm 
on  four  processors. 


4  processors.  Communication  of  three  nonzero  matrix  elements  and  one  right  hand  side  element, 
is  required  between  adjacent  processors,  leading  to  a  total  cost  of 


U 


,n  .12 

=  (--i)7+,+ 


4 

6’ 


for  the  first  phase. 

We  observe  that  the  unknowns  j '  —  {1, 2,  ...,k},  satisfy  an  independent  tridiagonal  system 

of  k  equations  distributed  with  one  equation  per  processor.  Clearly,  the  second  phase  will  consist 
in  solving  for  these  unknowns.  If  the  processors  form  a  ring  or  linear  array,  the  total  cost,  using  a 
regular  Gaussian  elimination  algorithm,  is  roughly 


/  , .  4  8 , 

(3  %  (k  —  1)(2 r  +  H — ) 
0  5 


In  the  third  and  final  phase,  each  processor  j  solves  for  the  other  variables  by  subtracting 
multiples  of  the  fill-in  columns.  This  operation  involves  no  communication,  except  for  the  transfer 
of  the  variables  j  £  already  computed  in  phase  2  from  processors  j  to  processors  j  +  1  for  j  — 
{l,2,...,/e  -  1}.  Therefore,  the  cost  of  the  third  phase  on  a  linear  array  or  ring  is  approximately 

5«  1 

*3  -  +  T  +  — . 

KS  0 

Going  back  to  implementing  the  Alternating  Direction  Method  on  our  grid  of  processors,  we 
observe  that  each  row  of  processors  can  be  considered  as  a  linear  array  of  k  =  %/k  processors. 


■'V-V-V 
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In  each  row  of  processors  we  will  solve  ti/k  tridiagonal  systems  by  the  substructuring  Gaussian 
elimination.  In  the  first  pass  of  the  Alternating  Direction  Method  all  the  rows  of  the  processor  grid 
will  work  simultaneously,  with  no  communication  between  the  processors  of  different  rows.  In  the 
second  pass  of  the  Alternating  Direction  Method  all  processors  of  the  same  column  of  the  processor 
grid  will  work  simultaneously  with  no  communication  between  different  columns. 

Implementing  the  first  phase  is  straighforward  and  its  cost  is 

,  n  12m  n  4  I2n2  4n 

t\  =  -. - +  r  +  -.-=  - —  +  r  +  — . 

K  KS  Kb  ks  Kb 

Implementing  the  second  phase  requires  a  little  more  care.  The  problem  is  that  we  have  ti/k 
tridiagonal  systems  each  of  size  k  and  with  one  row  per  processor.  Just  as  was  pointed  out  in 
our  first  implementation  of  the  Alternating  Direction  Method  for  the  grid  described  in  the  very 
beginning  of  this  section,  a  naive  implementation  consisting  in  sweeping  in  the  horizontal  direction 
for  all  variables  of  the  first  column  of  subsquares  before  forward-sweeping  in  the  second  column  of 
subsquares  and  so  on,  would  be  inefficient  because  only  one  processor  of  each  row  will  be  active  at 
any  time.  In  fact,  this  naive  approach  would  require  solving  h/k  tridiagonal  systems  of  size  k  each 
and  therefore  the  time  would  again  be  nearly  linear  in  «.  Once  again  the  remedy  is  to  use  pipelining. 
It  is  interesting  that  the  seemingly  minor  modification  of  pipelining  leads  to  a  completely  different 
complexity  analysis  and  a  different  conclusion. 

An  argument  similar  to  the  one  of  the  beginning  of  this  section  show's  that  in  order  to  complete 
the  forward  sweep  we  need  a  total  of  h/k  +  k  —  1  successive  steps  each  consisting  of  one  Gaussian 
elimination  of  one  row  of  a  tridiagonal  system  plus  a  transfer  of  the  pivot  row.  Clearly,  the  backward 
sweep  can  by  pipelined  likewise.  The  total  time  for  this  second  phase  is  approximately 
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Finally,  the  third  phase  is  fully  parallelizable  and  requires  a  time  of 
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The  total  as  a  function  of  k  comes  to  .  ^ccovion  For  j 
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The  important  point  is  that  it  is  now  possible  to  obtain  an  execution  tinge  of!  order ^less  thaiP 

O(n).  In  fact,  since  To(k)  is  of  the  form  j  ..  - -  .  . - 

n2  n  !  B 
Tc(k)  =  a~  +  i3-^=  +  +  Constant ,  D;  t  j . - 
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and  has  the  value 
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It  is  easy  to  show  by  employing  an  argument  similar  to  one  used  in  (14]  that  Tc,»  is  close  to  the 
actual  minimum  of  T<j(fc)  when  n  tends  to  infinity. 
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6.  Alternating  Direction  Methods  on  Hypercube  Architectures 

There  are  currently  several  existing  loosely  coupled  multiprocessors,  labeled  under  the  name 
hypercube,  which  are  based  on  the  binary  n-cube  networks  [1,  17,  16,  20].  An  m-dimensional 
hypercube,  or  m— cube,  consists  of  2m  nodes  that  are  numbered  by  m-bit  binary  numbers,  from 
0  to  2m  -  1  and  interconnected  so  that  there  is  a  link  between  two  processors  if  and  only  if  their 
binary  representation  differs  by  one  and  only  one  bit.  For  the  case  m  =  3,  the  8  nodes  can  be 
represented  as  the  vertices  of  a  three  dimensional  cube.  One  of  the  main  advantages  of  hypercubes 
is  that  many  of  the  classical  topologies  such  as  two-dimensional  or  three-dimensional  meshes  (8,  2, 
17,  20,  7]  can  be  imbedded  preserving  proximity  in  them.  An  immediate  consequence  of  this  is  that 
the  algorithms  presented  for  the  ring  and  the  grid  architectures  can  be  immediately  adapted  to  the 
hypercube.  Thus  one  step  of  the  Alternating  Direction  Method  can  be  run  on  a  hypercube  in  the 
same  time  as  on  a  mesh  configured  multiprocessor  with  the  same  number  of  processors,  i.e.,  in  time 
C>(n2/3)  for  a  sufficiently  large  dimension.  All  we  need  is  to  imbed  the  grid  into  the  hypercube  and 
then  use  the  substructuring  Gaussian  elimination  of  the  previous  section.  For  a  detailed  description 
on  how  to  map  a  grid  into  a  hypercube,  see  for  example  [17].  An  important  natural  question  is 
whether  or  not  there  is  another  algorithm  which  achieves  a  better  asymptotic  time  because  of  the 
more  powerful  topological  properties  of  the  hypercube.  Once  again  the  answer  is  yes. 

A  natural  candidate  for  solving  tridiagonal  systems  on  multiprocessors  is  the  cyclic  reduction 
algorithm  [4,  5].  Using  the  same  mapping  as  for  the  two-dimensional  grid,  i.e.,  partitioning  the 
grid  into  squares  of  size  2  x  2  and  assigning  the  square  in  position  (*,/)  to  processor  (t,j)  of 
a  two-dimensional  mesh  of  k  x  k  processors  embedded  in  the  hvpercube  by  the  proximity  pre¬ 
serving  embedding  described  above,  it  is  clear  that  each  of  the  solve  phases  in  the  Alternating 
Direction  Method  amounts  to  solving  in  each  row  or  column  of  the  conceptual  processor  mesh  n/n 
independant  tridiagonal  systems  each  of  which  is  split  into  k  equal  parts. 

It  is  important  to  note  that  when  using  the  cyclic  reduction  algorithm  the  distance  between 
the  rows  of  the  tridiagonal  system  will  increase  if  the  grid  points  are  not  carefully  assigned  to 
the  nodes.  With  a  mesh  embedding  based  on  a  binary-reflected  Gray  code  proximity  is  preserved 
throughout  the  reduction  process  [5].  For  one  tridiagonal  system  of  2m  equations,  equation  »  is 
mapped  into  the  node  whose  binary  label  is  <7,,  where  go~9u-  is  the  m-bit  binary-reflected 

Gray  code  [12].  It  can  be  easily  shown  that  g,  and  <7(,+2>)m<x/  2m  differ  in  exactly  two  bits,  for  all 
j> 0.  This  property  means  that,  with  this  Gray  code  assignment,  the  distance  between  consecutive 
rows  in  the  first  step  of  cyclic  reduction  is  one  and  is  exactly  two  in  the  subsequent  steps.  If  the 
tridiagonal  system  is  of  size  n  larger  than  k  =  2'".  then  substructuring  is  used  with  substructures  of 

2  equations  each  and  substructure  j,  j  =  {0. 1 . k  -  1},  assigned  to  processor  gj.  The  solution 

process  is  similar  to  the  substructured  Gaussian  elimination  described  previously,  except  that  cyclic 
reduction  is  used  for  phase  2.  The  times  for  phases  1  and  3  are  the  same  as  before.  The  time  for 
phase  2  is  [5] 

/17  ,  10\  „  ,,  „  5  1 

,1=^_+4r+Tj(/o9!K-1)-2r-i+-. 

Consider  now  the  problem  of  employing  this  technique  in  the  Alternating  Direction  Method. 
An  important  observation  is  that  each  column  (or  row)  of  processors  is  a  subcube  of  the  hypercube, 
on  which  work  can  be  done  independently.  Thus,  one  can  consider  using  the  above  mapping  based 
on  Gray  codes  on  each  of  the  subcubes  and  the  k  different  subcubes  will  solve  in  parallel  a  set  of 
n/K  tridiagonal  systems  each  of  size  n  and  spread  in  k  processors,  u/k  equations  per  processor. 
With  2  independant  tridiagonal  systems  to  be  solved  in  each  subcube  in  phase  2  the  following 
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estimate  is  derived  assuming  no  pipelining  of  communication  actions: 


[log2K  -  1)  -  2 t 


5/i  n 
6k  sk' 


If  the  two  successive  communications  in  a  reduction  step  can  be  pipelined,  then  the  term  ^  can 
be  reduced  to  |. 

By  comparing  the  time  for  phase  2  using  cyclic  reduction  on  a  hypercube  to  that  using  Gaus¬ 
sian  elimination  we  conclude  that  if  the  communication  start-up  time  is  high  with  respect  to  the 
elemental  transfer  time  1/b  and  with  respect  to  the  arithmetic  processing  time  l/s,  then  cyclic 
reduction  is  preferrable  to  Gaussian  elimination.  If  start-ups  can  be  ignored  then  Gaussian  elimi¬ 
nation  is  superior  when  n  is  large  relative  to  k  while  cyclic  reduction  is  superior  when  n  is  small. 
More  precisely,  let  us  assume  that  r  =  0,  and  define  a  =  6/s.  the  ratio  of  communication  to 
arithmetic  speeds.  Then,  by  comparing  the  expressions  for  t'2  for  Gaussian  elimination  and  cyclic 
reduction  respectively  and  by  neglecting  the  lower  order  terms,  we  find  that  the  break-even  point 
is  realized  when  n  is  approximately 


n(a) 


K2 
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What  is  interesting  is  that  the  two  fractional  coefficients  in  the  denominator  of  the  above  expression 
lie  in  the  intervals  [2.125,2.5]  and  [3,4.75]  respectively  for  all  values  of  a  between  zero  and  infinity. 
Therefore,  for  the  sake  of  simplicity  we  replace  the  above  expression  of  n(a)  by 

,  ,  k2  k 

n(a)  %  — - -  =  - - rr- 

2log2K  -  4  log 2  ^ 

Thus,  under  the  above  assumptions,  when  k  =  16,  (/;  =  256)  cyclic  reduction  is  faster  for  all  values 
of  n  not  exceeding  n  64.  When  k  =  64,  (k  =  4096)  the  break-even  occurs  at  n  512  while  for 
k  =  1024(k  =  220  as  1,048,576)  it  takes  place  at  n  65,536.  Of  course,  if  communication  start-ups 
were  to  be  included  then  the  competitiveness  of  cyclic  reduction  is  further  improved. 

It  should  be  noticed  that  with  the  above  simplistic  implementation  of  cyclic  reduction  the 
reduction  process  for  all  equations  converge  to  the  same  processor  and  this  leads  to  a  bottleneck. 
The  implementation  can  be  improved  by  making  the  reduction  process  for  different  equations 
converge  to  different  processors.  That  the  reduction  process  can  be  made  to  converge  to  an  arbitrary 
processor  is  easily  seen  if,  conceptually,  a  tridiagoual  system  is  extended  with  the  appropriate 
number  of  leading,  or  trailing  equations  with  a  corresponding  0  solution  (0  off-diagonal  elements 
and  right  hand  side).  For  k  tridiagonal  systems  solved  in  a  subcube  of  k  nodes  the  load  balancing 
Is  perfect,  and  each  node  performs  the  same  number  of  arithmetic  operations  in  the  reduction 
of  all  k  equations  as  is  required  for  the  reduction  of  a  single  equation.  Hence,  for  this  improved 
implementation  of  cyclic  reduction  we  arrive  at  the  following  estimate  for  the  time  of  one  step  of 
the  Alternating  Direction  Method 

.  ,  ,  /  5k  - /o< 72K  17k  -  18log2K  +  2\  r  n  . 

h  =  2(21o92k  -  l)r  +  (2. - - J 


The  number  of  start-ups  is  the  same  as  in  the  naive  implementation  of  the  cyclic  reduc¬ 
tion  method,  but  the  communication  and  arithmetic  bandwidth  requirements  are  reduced.  This 
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improved  implementation  of  cyclic  reduction  is  of  lower  complexity  than  Gaussian  elimination  if 
n  <  k2,  approximately,  ignoring  start-ups.  The  reason  for  the  advantage  of  Gaussian  elimination 
over  cyclic  reduction  for  £  >  k  is  due  to  its  lower  arithmetic  complexity.  The  solution  of  a  tridiag¬ 
onal  system  by  cyclic  reduction  requires  approximately  twice  the  number  of  arithmetic  operations 
of  Gaussian  elimination,  and  the  sequential  nature  of  Gaussian  elimination  is  outweighed  by  its 
arithmetic  efficiency  for  sufficiently  many  equations  per  node.  The  improved  implementation  of 
cyclic  reduction  is  competitive  for  n  <  64  for  k  =  8.  and  n  <  1024  for  k  =  32. 

Adding  the  complexities  for  all  three  phases  (/', ,  *2.  *3)  of  one  step  of  the  Alternating  Direction 
Method  yields 
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The  form  of  Tc{k)  is 


Tc( k)  =  a-. — |-  jl°(l2k  +  Constant , 
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and  its  minimum  is  achieved  for  k  =  n.  For  this  value  of  k 
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7.  Higher  Dimensional  Problems 

For  the  solution  of  higher  dimensional  problems  we  note  that  there  exist  no  proximity  pre¬ 
serving  embedding  of  a  m-dimensional  grid  in  an  n-dimensional  grid  for  m  >  n  [13],  Hence,  the 
order  of  the  computational  complexity  for  solving  a  two-dimensional  problem  on  a  linear  array 
cannot  be  reduced  below  O(n)  by  extending  the  linear  array  to  more  than  O(n)  nodes.  Solving  a 
m-dimensional  problem  with  nm  interior  grid  points,  m  >  2,  on  a  linear  array  requires  a  time  of 
order  0(nm_1). 

Similarly,  for  m  >  3  and  a  two-dimensional  array  of  processors,  the  minimum  time  is  of  order 
0(nm-2)  and  the  corresponding  number  of  processors  is  n2.  In  the  case  of  a  three-dimensional 
problem  one  simple  embedding  in  a  two-dimensional  array  of  processors  is  obtained  by  identifying 
all  points  in  one  dimension  and  using  the  two-dimensional  grid  embedding  described  previously. 
Two  of  the  three  computational  steps  of  the  Alternating  Direction  Method  for  each  time  step  axe 
performed  as  in  the  two-dimensional  grid  case  but  repeated  n  times  (for  the  third  dimension).  The 
third  step  is  local  to  each  processor  and  consists  in  solving  (^)2  tridiagonal  systems  of  n  equations 
each.  The  time  for  this  step  is 

8 n  /n\2 
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and  the  time  for  each  of  the  other  two  steps  is 
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assuming  that  the  reduced  system  is  solved  by  Gaussian  elimination.  It  is  clear  that  for  this 
embedding  k  —  n  minimizes  Tfc,  i  =  {1.2,3}. 


Higher  dimensional  meshes  can  be  embedded  in  sufficiently  large  hvpercubes  preserving  prox¬ 
imity  by  subdividing  the  cube  into  m  subcubes  for  a  m-dimensional  grid.  As  for  the  two-dimensional 
grid  the  m-dimensional  grid  is  partitioned  into  regions  and  the  regions  embedded  in  the  hypercube 
by  encoding  their  indices  in  a  binary-reflected  Gray  code.  With  a  m-dimensional  grid  partitioned 
uniformly  into  k  partitions  in  each  dimension,  the  grid  having  n  interior  points  in  each  dimension, 
and  the  subcubes  being  of  size  k ,  the  complexity  estimate  for  each  of  the  m  steps  of  the  Alternating 
Direction  Method  is 

Tc  (=)“  +  2(2Io9jk  -  l)r  +  +  HlC 

+  2r+6U) 


using  cyclic  reduction  for  the  solution  of  the  reduced  system.  The  minimum  is  the  same  as  in  the 
two-dimensional  grid  case,  namely 


Tq  *s  ^ 4r  +  2^  +  — ^  log2n. 


8.  Conclusion 

In  this  paper  we  have  proposed  several  implementations  of  the  Alternating  Direction  method 
on  multiprocessors.  Estimates  of  the  time  to  execute  one  half  step  of  the  algorithm  lead  us  to  the 
following  conclusions. 

•  The  shared  memory  model  and  broadcast  bus  model  do  not  allow  for  highly  efficient  imple¬ 
mentations  of  the  Alternating  Direction  Method. 

•  If  k  <  n  then  the  ring  architecture  is  sufficient  to  obtain  an  asymptotically  optimal  speed  up. 
The  optimal  time  is  of  order  0(nm-1)  for  a  m-dimensional  grid  of  0(nm)  interior  points.  The 
optimum  is  attained  for  n  processors. 

•  In  order  to  reduce  the  computational  complexity  further  it  is  necesary  to  use  an  array  of 
processors  of  a  dimensionality  closer  to  that  of  the  problem.  For  a  two-dimensional  grid  a 
two-dimensional  array  of  processors  suffices  to  achieve  a  complexity  less  than  O(n).  By  using 
a  substructuring  technique  and  solving  the  reduced  system  by  either  Gaussian  elimination  or 
cyclic  reduction  a  minimum  complexity  of  order  0(n 2/f3)  is  obtained.  The  number  of  processors 
for  which  the  minimum  is  attained  is  0(n4/3).  The  method  which  has  the  lowest  absolute 
complexity  is  dependent  on  the  ratio  of  the  communication  and  arithmetic  bandwidths  and  the 
ratio  2  [5],  For  a  m-dimensional  grid,  m  >  3.  the  minimum  time  for  a  two-dimensional  array 
of  processors  is  of  order  0(n”'~2)  and  the  corresponding  number  of  processors  0(n2). 

•  The  hypercube  architecture  makes  it  possible  to  exploit  very  large  scale  parallelism  in  the 
Alternating  Direction  Method.  One  step  of  ADI  on  an  an  m-dimensional  problem  can  be 
performed  in  time  proportional  to  alog2n  by  using  an  nm- node  hypercube.  This  time  is  of  the 
order  of  the  lower  bound. 

•  For  k  <  n,  i.e.,  when  the  number  of  processors  is  less  than  the  number  of  grid  points  in  one  di¬ 
mension,  the  lower  arithmetic  complexity  of  Gaussian  elimination  compared  to  cyclic  reduction 
may  outweigh  its  inherent  sequential  nature,  since  the  degree  of  parallelism  is  relatively  low. 
For  a  two-dimensional  problem  a  substructuring  algorithm  based  entirely  on  Gaussian  elimi¬ 
nation  may  be  of  a  lower  complexity  than  combining  local  Gaussian  elimination  with  “global" 
cyclic  reduction  if  2  >  k.  Consequently,  a  ring  or  a  grid  architecture  may  yield  a  complexity 
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comparable  to  that  of  a  hypercube  architecture.  For  higher  dimensional  problems  the  hyper¬ 
cube  architecture  always  yields  a  lower  asymptotic  complexity  than  a  ring  or  a  two-dimensional 
array  architecture.  The  break-even  point  between  Gaussian  elimination  and  cyclic  reduction 
occurs  approximately  at  n 
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